Identity-preserving techniques for generative adversarial network projection

ABSTRACT

An improved system architecture uses a pipeline including an encoder and a Generative Adversarial Network (GAN) including a generator neural network to generate edited images with improved speed, realism, and identity preservation. The encoder produces an initial latent space representation of an input image by encoding the input image. The generator neural network generates an initial output image by processing the initial latent space representation of the input image. The system generates an optimized latent space representation of the input image using a loss minimization technique that minimizes a loss between the input image and the initial output image. The loss is based on target perceptual features extracted from the input image and initial perceptual features extracted from the initial output image. The system outputs the optimized latent space representation of the input image for downstream use.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a non-provisional application of and claims thebenefit of the filing date of U.S. Provisional Application 63/092,980,filed on Oct. 16, 2020, which is herein incorporated by reference in itsentirety for all purposes.

This application is related to the concurrently filed applicationstitled “Multi-Scale Output Techniques for Generative AdversarialNetworks” and “Techniques for Domain-to-Domain Projection Using aGenerative Model,” which are herein incorporated by reference in theirentirety for all purposes.

This application is also related to the concurrently filed applicationstitled “Direct Regression Encoder Architecture and Training” and“Supervised Learning Techniques for Encoder Training,” which are hereinincorporated by reference in their entirety for all purposes.

TECHNICAL FIELD

This disclosure generally relates to image editing techniques. Morespecifically, but not by way of limitation, this disclosure describes animproved system architecture that uses a pipeline including an encoderand a Generative Adversarial Network (GAN) to generate edited imageswith improved speed, realism, and identity preservation.

BACKGROUND

Many image editing tools provide features that enable a user to edit ormodify an image. Some of these tools even use machine learning-basedtechniques for editing images. However, the image editing capabilitiesof such existing tools are quite limited—the recreation of images is notaccurate, the editing is limited to low-resolution images (e.g., 256×256pixels) (i.e., large high resolution images cannot be processed at allor cannot be processed in a reasonable time frame for the desired endresult), unwanted artifacts and effects are introduced into therecreated images, and other deficiencies.

Some image editing tools use machine learning models such as GenerativeAdversarial Networks (GANs) to generate edited images. While GANs havebeen successful in generating high quality edited images, existingtechniques using GANs still have several shortcomings. For example, somesystems use an optimization process to generate an editablerepresentation of an image. Generally the optimization process can takeseveral minutes and thus real-time results cannot be provided. Further,in prior systems, the image generated tends to diverge from theoriginal. This divergence can take multiple forms and can impactmultiple features of the content of the input image (e.g., for an imageof a face being edited, in the edited generated image, the teeth or noselooks different than from in the original image). The techniquesdescribed herein address these problems and others.

SUMMARY

The present disclosure describes techniques for editing images toefficiently generate realistic and accurate edited images. Moreparticularly, new and improved techniques are described for using apipeline including an encoder and a generative adversarial network toproject images into the latent space of the GAN with improved speed,realism, and identity preservation.

In some embodiments, a computer-implemented method includes producing aninitial latent space representation of an input image by encoding theinput image; generating, by a generator neural network, an initialoutput image by processing the initial latent space representation ofthe input image; generating an optimized latent space representation ofthe input image using a loss minimization technique that minimizes aloss between the input image and the initial output image, wherein theloss is based on target perceptual features extracted from the inputimage and initial perceptual features extracted from the initial outputimage; and outputting the optimized latent space representation of theinput image for downstream use.

In some aspects, the method further includes downsampling the inputimage before generating the initial latent space representation of theinput image. In some aspects, the method further includes computing theloss by downsampling the initial output image; passing the downsampledinitial output image as input to a convolutional neural network andextracting the initial perceptual features as output from a subset oflayers of the convolutional neural network; passing the downsampledinput image as input to the convolutional neural network and extractingthe target perceptual features from the subset of the layers of theconvolutional neural network; and computing the loss based upon thetarget perceptual features and the initial perceptual features. In someaspects, the convolutional neural network is a Visual Geometry Group(VGG) network, and wherein the subset of the layers include a conv1_1layer, a conv1_2 layer, a conv3_1 layer, and a conv4_1 layer of the VGGnetwork.

In some aspects, the loss is further based on one or more of: acomparison of pixels of the input image and pixels of the initial outputimage or a comparison of the initial latent space representation and atarget latent code. In some aspects, the downstream use includes one ormore of applying user-configured edits to the latent spacerepresentation of the input image or generating an output image, by thegenerator neural network, by processing the optimized latent spacerepresentation, wherein the output image is perceptually similar to theinput image.

In some aspects, the producing the initial latent space representation,optimizing the initial latent space representation, and generating theoutput image that is perceptually similar to the input image areperformed in less than about 10 seconds. In some aspects, the outputimage has a resolution of about 1024×1024 pixels. In some aspects, themethod of claim 6, further includes outputting the output image fordisplay on a computing device.

In some embodiments, a computing system includes a processor and anon-transitory computer-readable medium comprising instructions which,when executed by the processor, perform processing including producingan initial latent space representation of the input image by encoding aninput image; generating, by a generator neural network, an initialoutput image by processing the initial latent space representation ofthe input image; generating an optimized latent space representation ofthe input image using a loss minimization technique that minimizes aloss between the input image and the initial output image, wherein theloss is based on target perceptual features extracted from the inputimage and initial perceptual features extracted from the initial outputimage; and outputting the optimized latent space representation of theinput image for downstream use.

In some embodiments, a non-transitory computer-readable medium hasinstructions stored thereon, the instructions executable by a processingdevice to perform operations including producing an initial latent spacerepresentation of an input image by encoding the input image; a step forgenerating an optimized latent space representation of the input imagebased on target perceptual features extracted from the input image andinitial perceptual features extracted from the initial output image; andoutputting the optimized latent space representation of the input imagefor downstream use.

Various embodiments are described herein, including methods, systems,non-transitory computer-readable storage media storing programs, code,or instructions executable by one or more processors, and the like.

These illustrative embodiments are mentioned not to limit or define thedisclosure, but to provide examples to aid understanding thereof.Additional embodiments are discussed in the Detailed Description, andfurther description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

The patent or application file contains at least one drawing executed incolor. Copies of this patent or patent application publication withcolor drawing(s) will be provided by the Office upon request and paymentof the necessary fee.

Features, embodiments, and advantages of the present disclosure arebetter understood when the following Detailed Description is read withreference to the accompanying drawings.

FIG. 1 depicts an example of a computing environment for image editingaccording to certain embodiments of the present disclosure.

FIG. 2 depicts an example of a projection pipeline according to certainembodiments of the present disclosure.

FIG. 3A depicts an example of a process for projecting an image into thelatent space of a GAN with improved efficiency and identity preservationaccording to certain embodiments of the present disclosure.

FIG. 3B depicts an example of a process for computing a loss as used inthe process of FIG. 3A according to certain embodiments of the presentdisclosure.

FIG. 4 depicts examples of images generated with edits using thetechniques of FIGS. 3A-3B according to certain embodiments of thepresent disclosure.

FIG. 5 depicts additional examples of images generated with edits usingthe techniques of FIGS. 3A-3B according to certain embodiments of thepresent disclosure.

FIG. 6 depicts an example of a process for generating multi-resolutionoutputs from a GAN according to certain embodiments of the presentdisclosure.

FIG. 7 depicts a schematic diagram illustrating the multi-resolutionoutput process of FIG. 6 according to certain embodiments of the presentdisclosure.

FIG. 8 depicts another schematic diagram illustrating themulti-resolution output process of FIG. 6 according to certainembodiments of the present disclosure.

FIG. 9 depicts examples of generated images using the techniques of FIG.6, according to certain embodiments of the present disclosure.

FIG. 10 depicts additional examples of generated images using thetechniques of FIG. 6, according to certain embodiments of the presentdisclosure.

FIG. 11 depicts an example of a process for domain to domain projectionto certain embodiments of the present disclosure.

FIG. 12 depicts examples of images illustrating using a collage togenerate a realistic image using the techniques of FIG. 11 according tocertain embodiments of the present disclosure.

FIG. 13 depicts examples of images illustrating using a sketch togenerate a realistic image using the techniques of FIG. 11 according tocertain embodiments of the present disclosure.

FIG. 14 depicts examples of images illustrating using athree-dimensional (3D) drawing to generate a realistic image using thetechniques of FIG. 11 according to certain embodiments of the presentdisclosure.

FIG. 15 depicts an example of a computing system that performs certainoperations described herein according to certain embodiments of thepresent disclosure.

FIG. 16 depicts an example of a cloud computing environment thatperforms certain operations described herein according to certainembodiments of the present disclosure.

DETAILED DESCRIPTION

In the following description, for the purposes of explanation, specificdetails are set forth in order to provide a thorough understanding ofcertain embodiments. However, it will be apparent that variousembodiments may be practiced without these specific details. The figuresand description are not intended to be restrictive. The word “exemplary”is used herein to mean “serving as an example, instance, orillustration.” Any embodiment or design described herein as “exemplary”is not necessarily to be construed as preferred or advantageous overother embodiments or designs.

In certain embodiments, the disclosed techniques include new andimproved machine learning-based techniques such as using a generatorneural network (e.g., part of a GAN) to efficiently generate realisticand accurate images. To edit images with a generator neural network, alatent space representation z is discovered such that the image G(z)generated by the generator neural network is similar to a user-specifiedimage x. This process of discovering a latent space representationcorresponding to a user-specified image is called projection. The latentspace may, for example, be a hypersphere made up of variables drawn froma Gaussian distribution. In a training process, the generator neuralnetwork learns to map points in the latent space to specific outputimages. Such interpretation by the generator neural network givesstructure to the latent space, which varies according to the generatorused. For a given generator neural network, the latent space structurecan be analyzed and traversed to control image generation.

As noted above, various machine learning models are popularly used togenerate and edit realistic images. In particular, GANs can be used togenerate an image, either randomly or based on a real image. In existingsystems, there exists a trade-off between speed and accuracy. Withconventional systems, at best it takes several minutes to generate animage that looks realistic and replicates the original image. Generally,those systems that can deliver faster results do so with reducedaccuracy and/or resolution. For a compelling user experience, theprojection process should not only discover a latent spacerepresentation which accurately reconstructs a user-specified image, butit also should be efficiently computed within several seconds. Thus, amajor problem is finding a projection process that is efficient andaccurate. Prior techniques suffer from one or more of the following:

-   -   Inefficient. The projection should be done in seconds for a        compelling user experience, whereas high-resolution projection        typically takes about 5 minutes.    -   Does not maintain identity. For example, when projecting an        image of a person's face, the person's identity will change        making the output unusable for editing.    -   Low-resolution images are produced.    -   Require noise maps which cannot be cheaply transmitted across        networks in large-scale products.    -   Require retraining the generative model.

The present disclosure describes techniques for image generation andediting that address the above-noted deficiencies. In some aspects, alatent space representation of an input image is optimized both quicklyand with high resolution while providing accurate results includingidentity preservation. This latent space representation of the inputimage may be edited (e.g., editing a face image to make the persondepicted appear to smile or wear glasses). The edited latent spacerepresentation is processed using a generator neural network to generatean image that replicates the input image with improved speed, realism,and identity preservation. In some embodiments, an input image isprocessed by a pipeline of an image editing system including an encoderand generator. The encoder processes the input image to produce a latentspace representation of the input image. The latent space representationof the input image is optimized by minimizing a loss based on perceptualfeatures extracted from the input image and perceptual featuresextracted from the initial latent space representation of the inputimage. In alternative or additional embodiments, a discriminator losscomponent is added to the loss to constrain the output image towards aparticular image domain or style (e.g., to edit an input cartoon imageto appear like a photorealistic image). In alternative or additionalembodiments, the generator neural network is modified with auxiliarynetworks that produce rapid preview images.

The following non-limiting examples are provided to introduce certainembodiments. In these examples, an image editing system projects animage into the latent space of a GAN, resulting in a latent spacerepresentation (e.g., an N-dimensional vector or matrix representation)of the image. This latent space representation can be edited (e.g.,using vector addition or other techniques). When the edited latent spacerepresentation is processed with the GAN to generate an output image,the edits are reflected in the output image. For example, an image of ahuman face can be edited so that the face appears to smile, look olderor younger, turn the head to a different angle, and so forth.

In a first example, the image editing system applies techniques forgenerating an image based on an optimized latent space representation ofan input image while maintaining speed, resolution, and similarity tothe input image. First, the image editing system obtains an input image.For example, a user uploads an image to image editing software. Theimage editing system produces an initial latent space representation ofthe input image by encoding the input image. For example, thedownsampled input image is processed by an encoder neural networktrained to generate a latent space representation of an input image.

The initial latent space representation is processed with a generatorneural network to generate an initial output image. The initial latentspace representation is provided as input to a generator neural network,which has been pretrained to generate images from latent spacerepresentations of images. This results in an initial output image. Dueto the nature of the initial latent space representation of the inputimage, this initial latent space representation, when used to generatean output image, may produce an output image that does not lookadequately similar to the input image. Accordingly, the initial latentspace representation is then optimized.

To optimize the latent space representation, the image editing systemapplies a loss minimization technique that minimizes a loss between theinput image and the initial output image. The image editing systemcomputes a loss based on target perceptual features extracted from theinput image and initial perceptual features extracted from the initialoutput image. Perceptual features are visually representable propertiesof objects. Examples of perceptual features include size, shape, color,position, facial expression, and so forth. To extract perceptualfeatures, for example, the image editing system provides the images asinput to a convolutional neural network trained to classify images, andextracts features from selected layers of the convolutional neuralnetwork. The output of these layers can be correlated to differentperceptual features in an image. These perceptual features are compared,for the initial input image and the output image, to compute the loss.

The image editing system minimizes the loss to generate an optimizedlatent space representation of the input image. The image editing systemadjusts the latent space representation to minimize the loss. This maybe performed iteratively, e.g., by generating updated images using theupdated latent space representations, extracting perceptual featuresfrom the updated images, and recomputing the loss function, which isthen used to adjust the latent space representation repeatedly untilconvergence.

Once the latent space representation of the input image is optimized,the image editing system outputs the optimized latent spacerepresentation of the input image for downstream use. The downstream usemay include editing the latent space representation (e.g., so that theoutput image will look different such as a face looking older or a couchhaving a different shape). Alternatively, or additionally, thedownstream use may include processing the optimized latent spacerepresentation with the generator neural network to generate an outputimage that is perceptually similar to the input image. This process canbe used to project and generate an output image that is perceptuallysimilar to the input image in less than ten seconds.

In another example, the image editing system generates preview imagesusing a modified generator neural network. The image editing systemproduces a latent space representation of an input image. For example,the image editing system uses an encoder to generate the latent spacerepresentation, as described above with respect to the first example.The image editing system generates a first output image at a firstresolution by providing the latent space representation of the inputimage as input to a generator neural network. The generator neuralnetwork includes an input layer, an output layer, and multipleintermediate layers. The first output image is taken from one of theintermediate layers. In some implementations, the generator neuralnetwork is augmented with an auxiliary neural network trained togenerate the first output image from the intermediate layer.

The image editing system generates a second output image at a secondresolution different from the first resolution by providing the latentspace representation of the input image as input to the generator neuralnetwork and taking the second input image from the output layer of thegenerator neural network. This generates another, higher resolutionoutput image.

In some implementations, the first output image is used as a previewimage (e.g., for display on a user interface while further processing isperformed. Such a preview image can be generated quickly (e.g., in a fewseconds), as the image need not be processed by the whole generatorneural network. Although the preview image is lower resolution than thefinal output, the preview image is an accurate representation of thatultimate output. Such use of a specialized neural network to generatepreview images is particularly useful for image editing software whenthere may be a very high resolution image being generated that can take8 or more seconds or even minutes to optimize, as the preview image canbe generated in less than five seconds and shown to the user duringprocessing.

In another example, the image editing system uses an optimizationtechnique to modify a latent space representation of an input image in afirst domain, such that the ultimate output image is in a second targetdomain. The domains correspond to categories or styles of images. Forexample, the first domain is cartoons of people and the second domain isphotorealistic images of people. A cartoon image of a person is used togenerate a photorealistic image of a person that looks similar to thecartoon image. As another example, the first domain is a photograph of alandscape and the second domain is a painting of a landscape. Aphotograph of a landscape is used to generate an image in the style of alandscape painting style image that looks similar to the landscapephotograph.

The image editing system uses a pipeline including an encoder and a GANcomprising a generator neural network and a discriminator neuralnetwork. The image editing system obtains a first image in a firstdomain (e.g., a photograph of a person, a sketch, a collage, and soforth). For the purposes of this example, the input image is a sketch ofa face (e.g., the first domain is “sketch”) and the target domain is“photorealistic image.” In this case, the objective is to enforcerealism in the latent space representation of the input image. This isaccomplished using a GAN which has been pretrained to generatephotorealistic images of faces. Such a GAN includes a generator neuralnetwork that was trained to generate photorealistic images of faces anda discriminator neural network that was trained to recognize whether ornot an image is a photorealistic image of a face (e.g., as opposed to acomputer-generated image of a face).

The image editing system produces an initial latent space representationof the input image by encoding the input image, as described above withrespect to the first example. Similarly to the first example, the imageediting system minimizes a loss to update the initial latent spacerepresentation. In this case, the loss is based on output of thediscriminator. Since the discriminator is trained to recognize whetheran image is in a particular domain (e.g., that of photorealisticimages), a score generated by the discriminator is used to guide thelatent space representation toward the target domain.

The image editing system identifies information about a target domain.For example, a target latent code is selected according to userpreference and/or by selecting the mean latent code from GAN training.The target latent code is provided as input to the generator neuralnetwork, which outputs a target image. The target image is thenprocessed by the discriminator neural network to compute a target outputof the discriminator neural network.

The image editing system generates an initial output image by processingthe initial latent space representation of the input image with thegenerator neural network. This initial output image is provided as inputto the discriminator neural network. The discriminator neural networkoutputs a score indicating whether the initial output image is in thetarget domain. For example, a discriminator trained on digitalphotographs of human faces may output a score such as 1 or 100 if theimage looks exactly like a photograph of a human face, and a score suchas 0 or 50 if the image does not look like a photograph of a human faceor looks somewhat like a photograph of a human face.

The image editing system computes a loss based on the computed score.The loss may be based on the target discriminator output, the computedscore, and possibly other loss components, such as the perceptual lossdescribed above with respect to the first example. The image editingsystem minimizes the loss to compute an updated latent spacerepresentation of the input image. Since the discriminator was trainedto evaluate whether a generated image looks like a photorealistic imageof a human face, minimizing the discriminator loss constrains the latentspace representation towards the domain of photorealistic images ofhuman faces.

Upon computing the updated latent space representation, the imageediting system processes the optimized latent space representation withthe generator neural network to generate an output image that is in thetarget domain of photorealistic images of faces. Although this examplerelates to the domain of realistic face images, these techniques aresuitable for a wide range of applications, such as converting aphotograph of a dog to a cartoon, converting an image of a sculpture ofa person to a drawing of a person, and so forth.

Accordingly, as described herein, certain embodiments provideimprovements to computing environments by solving problems that arespecific to computer-implemented image editing environments. Theseimprovements include projecting an image into the latent space withimproved speed, resolution, and resemblance to the input image. Furtherimprovements can be provided, alternatively or additionally, bymodifying the generator neural network to quickly output one or morepreview images via an auxiliary neural network, which can be used togenerate a quick preview image. Further improvements can be provided,alternatively or additionally, by minimizing a loss based on adiscriminator output to project an image from one domain to another.Together or separately, these techniques significantly improve theresults and user experience of GAN projection.

Example of an Operating Environment for Image Projection and Editing

FIG. 1 depicts an example of a computing environment 100 including animage editing system 102 that provides capabilities for editingelectronic content such as digital photos and images. For example, asdepicted in FIG. 1, the image editing system 102 may receive as inputsan input image 106 that is to be edited and one or more edits to be madeto the input image 106. The image editing system 102 is configured toedit the input image 106 per the edits and generate an output image 150that is an edited representation of the input image 106 and incorporatesthe edits.

There are various ways in which the input image 106 and the edits to bemade are input to the image editing system 102. In the example depictedin FIG. 1, the image editing system 102 may provide an editor interface104 that a user may use to provide inputs regarding the input image 106to be edited and the one or more edits (e.g., edit parameters 108 to bemade to the input image 106). The image editing system 102 thengenerates an edited output image 150 by applying the user-provided editsto the input image 106. In certain embodiments, the edited output image150 may be presented or output to the user using the editor interface104.

In some embodiments, the editor interface 104 may include one orGraphical User interfaces (GUIs) that enable a user to provide inputsidentifying the input images, identifying the edits to be made to bemade, setting configuration parameters for the image editing system 102,and the like. For example, a GUI may include one or more user-selectableelements that enable a user to input images 106 to be edited. One ormore GUIs provided by the editor interface 104 may include one or moreupload elements for uploading content (e.g., an upload field to uploadan image to be edited). In some implementations, the editor interface104 responds to user selection of an upload element by transitioning toa view showing available files to upload, prompt a user to take a photo,or the like.

One or more GUIs provided by the editor interface 104 may also includeuser-selectable elements that enable a user to specify the edits ormodifications to be performed. For example, a GUI may display one ormore sliders that can be manipulated by the user, each slidercorresponding to an attribute of the image to be edited. Other elementsprovided by the GUIs may include text entry fields, buttons, pull-downmenus, and other user-selectable options. In certain implementations,the editor interface 104 may be part of a content editing software suchas Adobe Photoshop®, which is capable of receiving and editing digitalcontent (e.g., digital photographs or other images).

In some embodiments, the image editing system 102 and the editorinterface 104 execute on a computing device, which may be used by auser. Examples of a computing device include, but are not limited to, apersonal computer, a tablet computer, a desktop computer, a processingunit, any combination of these devices, or any other suitable devicehaving one or more processors. In some other embodiments, the imageediting system 102 and the editor interface 104 may operate on differentcomputing systems, which may be communicatively coupled to each other.Examples of computer platform and implementations that may be used toimplement the image editing system 102 are depicted in FIGS. 15 and 16and described below.

The image editing system 102 may include multiple subsystems, which workin cooperation to generate edited output images 150. In the embodimentdepicted in FIG. 1, the image editing system 102 includes a projectionsubsystem 110, a training subsystem 140, an edit management subsystem120, and an image generation subsystem 130. Computing environment 100depicted in FIG. 1 is merely an example and is not intended to undulylimit the scope of claimed embodiments. Many variations, alternatives,and modifications are possible. For example, in some implementations,the image editing system 102 may have more or fewer subsystems thanthose shown in FIG. 1, may combine two or more subsystems, or may have adifferent configuration or arrangement of subsystems. The varioussystems, subsystems, and other components depicted in FIG. 1 may beimplemented in software (e.g., code, instructions, program) onlyexecuted by one or more processing units (e.g., processors, cores) ofthe respective systems, using hardware only, or combinations thereof.The software may be stored on a non-transitory storage medium (e.g., ona memory device).

The various subsystems of the image editing system 102 can beimplemented in the same computing system or different, independentlyoperated computing systems. For example, the edit management subsystem120 could be a separate entity from the projection subsystem 110, theimage generation subsystem 130, and the training subsystem 140, or thesame entity. The image editing system 102 may execute on a serverseparately from the editor interface 104, or other embodiments caninvolve the image editing system 102 being built into a softwareapplication executing the editor interface 104 on a user device.

One or more of the subsystems of the image editing system 102 includetrained machine learning models or include components that use machinelearning models that have been trained. For example, in the embodimentdepicted in FIG. 1, the training may be performed by a trainingsubsystem 140, which may perform the training using various trainingdata 142. In some implementations, the training subsystem 140 includes,or is communicatively coupled to, one or more data storage units 141 forstoring the training data 142.

An edit management subsystem 120 configures edits to the input image 106using an edit configurer 122 and a feedback generator 124. A projectionsubsystem 110 generates a latent space representation 113 representingthe input image 106. A latent code transformer 114 generates a modifiedlatent space representation 117 by applying one or more transformations,including the edits configured by the edit management subsystem 120, tothe latent space representation 113 of the input image. An imagegeneration subsystem 130 includes a generator 132 that generates animage according to the transformed latent space representation 117. Insome aspects, the image generation subsystem 130 further includes apostprocessor 134 that performs postprocessing of the generated image139 to produce the output image 150, which may be returned to the editorinterface 104. In some embodiments, the training subsystem 140 trainsone or more components of the latent code transformer 114 using thetraining data 142. In some implementations, the training subsystem 140trains the generator 132 using a discriminator 146. In someimplementations, the training subsystem 140 trains the encoder 112and/or components of the latent code transformer 114 using one or moreloss functions 144.

The edit management subsystem 120 includes hardware and/or softwareconfigured to control image edits. The edit management subsystem 120includes an edit configurer 122 and a feedback generator 124. The editconfigurer 122 receives edit parameters 108, e.g., editor-configuredmodification instructions, from the editor interface 104. For example,edit parameters 108 may specify that an image of a person should bemodified to include red hair and glasses. The edit configurer 122transmits an indication of the edit parameters 108 to the latent codetransformer 114 of the projection subsystem 110 for further processing.

The feedback generator 124 prepares and transmits edit feedback 128 tothe editor interface 104. Examples of such edit feedback 128 includesmetrics showing how much an attribute is being modified (e.g., numericalvalues showing the selected edit parameters 108). Alternatively, oradditionally, the edit feedback 128 includes preview images showing howthe output image will appear given the current edit parameters. In someembodiments, the feedback generator 124 receives reduced-resolutionpreview images 135 from auxiliary networks 133A, 133B of the GAN 138, asdescribed herein. The feedback generator 124 uses the reduced-resolutionpreview images 135 to provide a quick preview image to the editorinterface 104.

The projection subsystem 110 includes hardware and/or softwareconfigured to identify and transform latent space representations ofimages. The projection subsystem 110 receives as input the input image106 and generates as output a modified latent space representation ofthe input image 117, which is a vector string of numbers reflectingedits to be applied to the input image 106.

In some implementations, the projection subsystem 110 includes anencoder 112 configured to receive an input image 106, project the inputimage 106 into a latent space representation 113, and output the latentspace representation 113. The projection subsystem 110 further includesand a latent code transformer 114 for performing transformations andother modifications to the latent space representation 113 to generate amodified latent space representation 117.

In some implementations, the encoder 112 is a machine learning modelthat has been trained to discover a latent space representation of theinput image 106. The latent space representation (also referred to assemantic latent code or latent code) is a string of numbers (e.g., an-dimensional vector, containing a value for each of the n-dimensions)that, when provided as input to the generator, creates a particularimage (e.g., to replicate the input image 106). The encoder 112 is amachine learning model trained to generate such a latent spacerepresentation. The encoder 112 may, for example, be a feed forwardnetwork trained to encode the input image 106. Given an input image 106and a generator 132, the encoder discovers a latent space representationof the input image z, such that when the latent space representation ofthe input image z is input to the generator 132, the resulting generatedimage 139 perceptually resembles the target input image 106.

The latent code transformer 114 includes functionality to optimize,transform, and/or edit the latent space representation 113 and/or aninitial latent code to generate the modified latent space representation117. Such transformations may include modifications received from theedit management subsystem 120. Alternatively, or additionally, thetransformations include mappings to make the latent code more easilyeditable or more easily digestible by the generator 132. Thetransformations further include an optimization process performed by theoptimizer 114A to increase the similarity between the latent spacerepresentation and the original input image 106. The latent codetransformer 114 outputs the transformed latent space representation 117to the generator 132 for further processing. In some aspects, the latentcode transformer 114 includes an optimizer 114A, a mapper/augmenter114B, and a latent code editor 114C.

The optimizer 114A includes functionality to optimize the latent spacerepresentation of an input image. In some aspects, the optimizer 114Atakes an initial latent space representation and optimizes the latentspace representation according to one or more loss functions. The lossis minimized until the transformed latent space representation 117 isperceptually similar to the input image 106 to a desired degree. In someimplementations, the loss function further includes components forcontrolling qualities of the latent space representation such as realismconstraint. The optimizer 114A can use a combination of loss componentsincluding a pixel loss 115A, perceptual loss 115B, latent loss 115C, anddiscriminator loss 115D to optimize and/or control the latent spacerepresentation, as described herein.

The pixel loss 115A is a function of pixels of the input image andpixels of an image generated from the initial latent spacerepresentation. Minimizing the pixel loss 115A steers the latent spacerepresentation to produce images similar to the input image on apixel-by-pixel basis. The perceptual loss 115B is a function ofperceptual features extracted from the input image, and perceptualfeatures of an image generated from the initial latent spacerepresentation. Minimizing the perceptual loss 115B steers the latentspace representation to produce images similar to the input imageaccording to high level or low level perceptual features. For example,different layers of a convolutional neural network can be used toextract high-level or low-level features for comparison.

The latent loss 115C is a function of a latent space representation ofthe input image and a target latent code. Minimizing the latent loss115C can be used to steer the latent space representation towardsgreater similarity with the input image. The discriminator loss 115D isa function of a discriminator output generated using the latent spacerepresentation of the input image and a target discriminator output.Minimizing the discriminator loss 115D can be used to steer the latentspace representation to produce images in the domain in which thediscriminator was trained (e.g., to enforce realism or change a photo toa sketch, as described herein).

The mapper/augmenter 114B includes functionality to map the latent spacerepresentation 113 from one latent space to another. For example, theencoder 112 generates a latent code in a first space, Z space, and themapper/augmenter 114B applies a mapping to transform the latent codefrom the Z space to a second space, W space. This mapping is executed insome implementations to facilitate image editing by transforming thelatent space such that movement in the latent space smoothly correlateswith changes to one or more target attributes. As an example, in the Wspace, incrementing the latent variable in a particular directioncontinuously makes hair color lighter in an image while maintaining theoverall look of the image. In the Z space, such smooth changes withdirection in the latent space are not always possible, as the Z space ismore “entangled.” W space transformation techniques and advantages aredescribed in Karras et al., “A Style-Based Generator Architecture forGenerative Adversarial Networks”, https://arxiv.org/pdf/1812.04948.pdf(2019) (“StyleGAN”) and Shen et al., InterFaceGAN: Interpreting theDisentangled Face Representation Learned by GANs, arXiv:2005.09635(2020).

In some implementations, the mapper/augmenter 114B further includesfunctionality to augment the latent space representation 113 from onedimensionality to another (e.g., to an extended latent space, alsoreferred to as “W-plus” or “W_(p)” space). For example, themapper/augmenter 114B transforms W space latent code, which is 512dimensions, to W_(p) space latent code, which is 512×18 dimensions. Thisfacilitates image editing based on continuous properties of the latentspace. W_(p) space transformation techniques and advantages aredescribed in Abdal et. al., “Image2StyleGAN: How to Embed Images Intothe StyleGAN Latent Space?,” arXiv:1904.03189 (2019).

The latent code editor 114C applies changes to the latent spacerepresentation 113 (e.g., after optimization performed by the optimizer114A and any mappings or augmentations performed by the mapper/augmenter114B), based upon edit parameters received from the edit configurer. Forexample, the latent code editor 114C applies linear and/or nonlinearmodifications to the latent space representation based on trainingindicating that these modifications will cause a desired change in theultimate output image (e.g., to make a person depicted in an imageappear to smile, be older, etc.).

Thus, the latent space representation 113 generated by the encoder 112is processed by one or more components of the latent code transformer114 to generate the modified latent space representation 117, which ispassed to the image generation subsystem 130 for further processing.

In some embodiments, the image generation subsystem 130 includeshardware and/or software configured to generate an output image 150based on input code (e.g., the modified latent space representation117). The image generation subsystem includes a generator 132 and apostprocessor 134.

The generator 132 includes a machine learning model which has beentrained to generate a generated image 139 based on input latent code. Insome implementations, the generator 132 is a neural network. Thegenerator 132 is pretrained to generate data that is similar to atraining set. Depending on the type of image to be edited by the imageediting system 102, the generator may be trained to generate an image ofa human face, a landscape, a dog, a cat, a shoe, and so forth. In someaspects, the generator 132 is trained to generate a specific type ofimage, as such targeted training can produce very realistic results. Thegenerator 132 can produce a random new image (e.g., of a person thatdoes not exist) based on random input (e.g., from a normal or Gaussiandistribution). The generator can produce a new image that looks like aninput image 106 using the techniques described herein and an inputlatent space representation of an image that is generated based on theinput image 106. In some implementations, the generator 132 is part of aGenerative Adversarial Network (GAN) 138, and is trained in a zero-sumgame with the discriminator 145.

In some embodiments, the generator 132 is attached to one or moreauxiliary networks 133A, 133B. Although two auxiliary networks 133A and133B are pictured, more or fewer auxiliary networks may be implemented.The auxiliary networks 133A and 133B are neural networks attached toselected layers of the generator 132. The auxiliary networks 133A and133B are trained to output a reduced-resolution version of the ultimateGAN output 139 using intermediate feature vectors extracted from theintermediate layers of the generator 132. These reduced-resolutionpreview images 135 are transmitted to the feedback generator 124 forfurther processing.

In some embodiments, the postprocessor 134 ingests the generated image139 and performs processing to prepare the output image 150. In someaspects, the projection subsystem 110 projects a portion of the inputimage 106 (e.g. a cropped region such as a face or a flower from withina larger image). In such cases, the generated image 139 is a subset ofthe input image 106, and the postprocessor 134 integrates the generatedimage 139 into the remaining portion of the input image 106 to generatethe output image 150. Other postprocessing performed by postprocessor134 may include smoothing portions of the generated image 139,increasing or decreasing the pixel size of the generated image 139,and/or combining multiple generated images 119.

The training subsystem 140 includes hardware and/or software configuredto train one or more machine learning models as used by the imageediting system 102. The training subsystem 140 includes a discriminator136. The discriminator 136 is part of the GAN 138 including thegenerator 132, and evaluates the output of the generator 132 to trainthe generator 132. The discriminator 136 compares images produced by thegenerator 132 to target images (e.g., digital photographs, drawings, orthe like). The discriminator 136 generates a score based on thecomparison. For example, if the GAN 138 is trained on digitalphotographs, the score generated by the discriminator 136 indicateswhether the discriminator has determined that an image generated by thegenerator is likely to be a real photograph or a computer-generatedcopy. The generator 132 works to “trick” the discriminator intodetermining that a generated image is actually a target image such as areal photograph. Such a competition between the discriminator 136 andthe generator 132 can be used to teach the generator to produceextremely realistic images.

The training subsystem 140 further includes functionality to train theencoder 112, including one or more loss functions 144 that are minimizedto train the encoder 112 to generate latent space representation thataccurately represents the input image 106 and can be processedefficiently by the other elements of the projection subsystem 110. Insome aspects, the training subsystem further includes functionality totrain the edit configurer 122 and/or postprocessor 134. In someimplementations, the training subsystem 140 is further configured totrain the latent code transformer 114 to edit images.

The data storage unit 141 can be implemented as one or more databases orone or more data servers. The data storage unit 141 includes trainingdata 142 that is used by the training subsystem 140 to train the enginesof the image editing system 102. The training data 142 may include realimages, synthetic images (e.g., as generated by the GAN), and/or latentspace representations of the real and synthetic images.

Example Projection Pipeline

FIG. 2 depicts an example of a projection pipeline 200 according tocertain embodiments of the present disclosure. The projection pipeline200 includes an encoder 206 and a generator 210. In the projectionpipeline 200, an input image 202 is encoded using the encoder 206 toproduce a latent space representation w 208, which is then optimizedusing a combination of pixel loss 212, latent loss 216, and perceptualloss 218, resulting in an optimized latent space representation w_opt228. In some implementations, some or all of the processing of FIG. 2may be performed by an image editing system (e.g., the projectionsubsystem 110 in cooperation with other components of the image editingsystem 102 of FIG. 1).

In some implementations, the projection process includes:

-   -   1. Use an encoder 206 to predict an initial latent code w₀    -   2. Initialize a variable w with the latent code w₀    -   3. For each iteration of the optimization, compute a loss        between the target image 202 and the initial output image 211.

In some embodiments, the projection subsystem starts with an input image202. This may be an image that a user seeks to edit, e.g., via an editorinterface as shown in FIG. 1. The projection subsystem downsamples theinput image 202 at 204. For example, the input image may be a relativelylarge image file such as a 1024×1024 pixel image. The projectionsubsystem may, for example, apply an algorithm such as bicubicinterpolation to downsample the image. In the example depicted in FIG.2, the projection subsystem downsamples the image to 256×256 pixels. Inother examples, the projection subsystem may downsample the image toother resolutions (e.g., 128×128 pixels or 512×512 pixels).

In some embodiments, the projection subsystem feeds the downsampledimage to the encoder 206. Using the encoder 206 (and potentially withadditional mappings and transformations, as described above with respectto FIG. 1), the projection subsystem produces w 208, a latent spacerepresentation of the downsampled input image. The initial encoderoutput w 208 may diverge from the input image in certain respects. Forexample, without optimization, the image may not even look like the sameperson. The projection subsystem optimizes the latent spacerepresentation w using pixel loss 212, latent loss 216, and perceptualloss 218 to increase the similarity between the input image 202 and theultimate output image. These losses may be minimized individually, or aspart of a loss function with various terms as described below withrespect to block 310 of FIG. 3.

In some embodiments, the projection subsystem minimizes a pixel loss212. First, an initial output image 211 is generated by passing thelatent space representation w 208 as input to the generator 210. Theprojection subsystem computes the pixel loss 212 as a function of theinitial output image 211 and the input image 202. The projectionsubsystem minimizes the pixel loss 212, and the latent spacerepresentation w 208 is adjusted accordingly.

In some embodiments, the projection subsystem minimizes a perceptualloss 218. First, an initial output image 211 is generated by passing thelatent space representation w 208 as input to the generator 210. Theprojection subsystem downsamples the initial output image 211 at 220 andpasses the downsampled image as input to selected layers of aconvolutional neural network (e.g., the Visual Geometry Group (VGG)network 224) to extract perceptual features. Similarly, the input image202 is downsampled at 222 and passed as input to the selected layers ofthe VGG network 224 to extract perceptual features. Layers near theinput layer of the VGG network tend to pick up pixel-level features, anddeeper layers in the network pickup edges and blobs, and at layerscloser to the output layer pick up object-level features. Accordingly,layers closer to the input layer or output layer can be selected toextract different levels of perceptual features. The projectionsubsystem computes the perceptual loss 218 as a function of the featuresextracted from the input image 202 and the initial output image 211. Theprojection subsystem minimizes the perceptual loss 218, and the latentspace representation w 208 is adjusted accordingly.

In some embodiments, the projection subsystem minimizes a latent loss216. A target latent space representation w_target 214 is identified.The latent loss 216 is computed as a function of the latent spacerepresentation w 208 and the target latent space representation w_target214. The projection subsystem minimizes the latent loss 216, and thelatent space representation w 208 is adjusted accordingly. Afteradjusting the latent space representation w using the perceptual loss218, the pixel loss 212, and/or the latent loss 216, an optimized latentspace representation w_opt 228 is produced.

Accordingly, in some embodiments, the projection subsystem minimizesloss functions or components including pixel loss 212, perceptual loss218, and latent loss 216 to increase accuracy of projection onto the GANlatent space. These projection techniques and their advantages aredescribed in further detail below with respect to FIGS. 3A-3B.

Example Techniques for Identity Preserving Latent Space Projection

FIG. 3A is a flowchart of an example process 300 for projecting an imageinto the latent space of a GAN with improved efficiency and identitypreservation according to certain embodiments of the present disclosure.The processing depicted in FIG. 3A may be implemented in software (e.g.,code, instructions, program) executed by one or more processing units(e.g., processors, cores) of the respective systems, using hardware, orcombinations thereof. The software may be stored on a non-transitorystorage medium (e.g., on a memory device). The method presented in FIG.3A and described below is intended to be illustrative and non-limiting.Although FIG. 3A depicts the various processing steps occurring in aparticular sequence or order, this is not intended to be limiting. Incertain alternative embodiments, the processing may be performed in somedifferent order or some steps may also be performed in parallel. In someimplementations, one or more process blocks of FIG. 3A may be performedby an image editing system (e.g., the projection subsystem 110 incooperation with other components of the image editing system 102 ofFIG. 1). In some implementations, one or more process blocks of FIG. 3Amay be performed by another device or a group of devices separate fromor including the image editing system 102 (e.g., the editor interface104 executing on a user device).

In some embodiments, at 302, the projection subsystem obtains an inputimage. For example, the projection subsystem receives an input imagethat is uploaded via the editor interface 104. The input image may be animage file that is to be edited (e.g., to change facial expression orage, as shown in FIGS. 4 and 5). Alternatively, or additionally, theprojection subsystem may obtain the input image by retrieving the imagefrom a local or remote database.

In some embodiments, at 304, the projection subsystem downsamples theinput image. For example, the projection subsystem can apply aninterpolation algorithm such as area interpolation or bicubicinterpolation, (see, e.g., Rajarapollu et al., Bicubic InterpolationAlgorithm Implementation for Image Appearance Enhancement, IJCST Vol. 8,Issue 2 (2017)), to the input image obtained at 302 to generate adownsampled input image. In some implementations, the projectionsubsystem downsamples the input image to 256×256 pixel resolution. Useof a downsampled input image can significantly increase the speed of theoptimization process. The benefits of downsampling the input imageinclude significant increases in projection speed, as the followingsteps are processed using a smaller input file size. As can be seen inthe example outputs of FIGS. 4-5, an accurate and high resolution outputcan still be achieved. Alternatively, in some implementations, step 304is omitted and the input image is provided as input to the encoder at306 without downsampling.

In some embodiments, at 306, the projection subsystem produces aninitial latent space representation of the input image by encoding thedownsampled input image. For example, the projection subsystem producesthe initial latent space representation by providing the downsampledinput image as input to an encoder (e.g., to generate a Z spacerepresentation of the input image). This results in a latent spacerepresentation z of the input image. In some implementations, the latentspace representation is further modified to map to W space and/oraugmented to W_(p) space, as described above with respect to FIG. 1.Alternatively, a W_(p) space representation can be generated directlyusing the techniques described in “Direct Regression EncoderArchitecture and Training,” filed concurrently herewith. By encoding theimage before optimization, the projection subsystem initializes theoptimization using an encoded image that is similar to the input image.Encoding the image before optimization further speeds up theoptimization process, as the time to converge is faster when startingwith a similar image rather than starting with a random image (e.g., asdrawn from a Gaussian distribution).

In some embodiments, at 308, the image editing system generates, by agenerator neural network, an initial output image by processing thelatent space representation of the input image. For example, theprojection subsystem transmits the latent space representation of theinput image to the image generation subsystem 130. The image generationsubsystem passes the filtered latent space representation as input to agenerator neural network to generate the initial output image.Techniques for image generation with a generative model are described indetail in, e.g., Goodfellow et al., Generative Adversarial Nets, NIPS2014, arXiv:1406.2661v1 (2014) and Karras et al. (2019) (StyleGAN,supra).

The image editing system may initially generate a first initial outputimage by processing the initial latent space representation generated at306. Subsequently, after updating the latent space representation at312, the image editing system may generate one or more updated initialoutput images by processing the updated latent space representation(s)in the course of one or more subsequent iterations of the optimizationprocess (e.g., a second initial output image, a third initial outputimage, etc.).

At 310, the projection subsystem computes a loss based on targetperceptual features extracted from the input image and perceptualfeatures extracted from the initial output image. Perceptual featuresare visually representable properties of objects, such as size, shape,color, position, facial expression, etc. These perceptual features arecompared, for the input image and the initial output image (e.g., afirst initial output image and/or updated initial output imagesgenerated at 308), to compute the loss. Techniques for extracting theperceptual features and computing a suitable loss function are describedin further detail below with respect to FIG. 3B.

At 312, the projection subsystem updates the latent space representationaccording to the computed loss. The projection subsystem may use asuitable optimizer to compute an updated value of w.

In some implementations, the latent space representation is updated bycomputing

argmin w=Loss(w,x),

by applying an optimization algorithm (as further described below withrespect to block 314) to the latent space representation w using theloss computed as described with respect to block 310 and FIG. 3B.

At 314, the projection subsystem determines whether the loss isminimized. In some implementations, the projection subsystem applies theLimited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS) tominimize the loss function. L-BFGS uses a limited amount of computermemory. Use of L-BFGS for the optimization can speed up the optimizationprocess and limit the amount of computational resources required.Compared to other optimizers tested, it has been found that the SciPyL-BFGS optimizer generates the best results in the least amount of time.Alternatively, or additionally, other optimizers may be implemented,such as traditional BFGS, Quasi-Newton Method, or theDavidson-Fletcher-Powell (DFP) formula.

If the loss is not minimized at 314, then the flow returns to 308. Theupdated initial latent space representation is used to generate anupdated initial output image at 308. Thus, the projection subsystemupdates the latent space representation iteratively based on theminimized loss (e.g., to generate a first updated latent spacerepresentation, a second updated latent space representation, and soforth).

This can be repeated until eventually the latent space representation issufficiently optimized (e.g., optimization has converged), at whichpoint the loss is minimized at 314. If the loss is minimized at 414,then the process 300 proceeds to 316. When the loss is minimized, theupdated latent space representation is considered optimized. Thus, oneor more operations in blocks 308-314 generate an optimized latent spacerepresentation of the input image using a loss minimization techniquethat minimizes a loss between the input image and the initial outputimage, wherein the loss is based on target perceptual features extractedfrom the input image and initial perceptual features extracted from theinitial output image.

The optimized latent space representation is one that will produce anoutput image that looks very similar to the input image (e.g.,indistinguishable or almost indistinguishable to the human eye). Withoutoptimization, the generator can produce a high resolution andphotorealistic image, but the image will not look perceptually similarto the input image. For example, for images including a human face,without optimization, the output image will generally not look like thesame person as that depicted in the input image. Once the latent spaceimage is optimized, the ultimate output image will be perceptuallysimilar to the input image. Perceptually similar images have similarperceptual features. For example, for images including human faces,perceptual features include hair color, nose shape, and facialexpression. Images that are perceptually similar will generally looklike the same person.

One or more operations in blocks 308-314 implement a step for optimizingthe initial latent space representation based on target perceptualfeatures extracted from the input image and initial target featuresextracted from the initial output image. For instance, at 308, theprojection subsystem processes the initial latent space representationwith a generator neural network to generate an initial output image, andat 310, the projection subsystem minimizes a loss between the inputimage and the initial output image to generate the optimized latentspace representation, as described above and with respect to FIG. 3B.

In some embodiments, at 316, the projection subsystem outputs theoptimized latent space representation of the input image for downstreamuse. The downstream use may include applying user-configured edits tothe latent space representation. For example, the latent spacerepresentation may be modified in a way that corresponds to changes suchas making a face in an image appear to smile or look older, add highheels to a shoe in an image, and so forth. Alternatively, oradditionally, the downstream use may include processing the optimizedlatent space representation with the generator neural network togenerate an output image that is perceptually similar to the inputimage. This may be performed in a similar fashion as described abovewith respect to block 308, but using the optimized latent spacerepresentation as the input to the generator. The optimized latent spacerepresentation provided to the generator as input may be edited orunedited.

In some implementations, the generating the initial latent spacerepresentation, optimizing the initial latent space representation, andgenerating the output image that is perceptually similar to the inputimage is performed in less than about 10 seconds, in less than about 9seconds, and/or in less than about 8 seconds. The techniques of FIG. 3Aefficiently produce a projection without identity loss athigh-resolution (e.g., 1024×1024 pixels) in about 8 seconds on a NvidiaTesla V100 GPU. Accordingly, the techniques described above with respectto FIG. 3 significantly reduce the speed of generating a high-resolutionand accurate image, which takes several minutes in many prior systems.

In some embodiments, the process 300 further includes outputting theoutput image to a computing device for display. The computing device maycorrespond to the editor interface 104 depicted in FIG. 1 (e.g.,executing on a user device or the image editing system itself). Forexample, the image editing system outputs the output image to a userdevice, thereby causing the user device to display the output image viathe editor interface displayed on the user device. Alternatively, oradditionally, the image editing system transmits instructions forrendering the output image to an external computing device.Alternatively, or additionally, the image editing system renders theoutput image on a display component of the image editing system itself.

In some embodiments, prior to the processing of FIG. 3, the encoder istrained on synthetic images. For example, the encoder is trained onimages generated by a generator such a StyleGAN generator (as describedin Karras et al. (2019), supra). In some implementations, thegenerator-created images are generated from a Gaussian distribution. Insome implementations, the Gaussian distribution is truncated (e.g.,using a truncation value of 0.7). Training with synthetic images hasbeen found to provide regularization, leading the encoder to predictlatent codes corresponding to images that the generator implemented(e.g., StyleGAN) can generate accurately.

FIG. 3B is a flowchart of an example process 350 for computing a loss(e.g., at block 310 of FIG. 3A) according to certain embodiments of thepresent disclosure. The processing depicted in FIG. 3B may beimplemented in software (e.g., code, instructions, program) executed byone or more processing units (e.g., processors, cores) of the respectivesystems, using hardware, or combinations thereof. The software may bestored on a non-transitory storage medium (e.g., on a memory device).The method presented in FIG. 3B and described below is intended to beillustrative and non-limiting. Although FIG. 3B depicts the variousprocessing steps occurring in a particular sequence or order, this isnot intended to be limiting. In certain alternative embodiments, theprocessing may be performed in some different order or some steps mayalso be performed in parallel. In some implementations, one or moreprocess blocks of FIG. 3B may be performed by an image editing system(e.g., the projection subsystem 110 in cooperation with other componentsof the image editing system 102 of FIG. 1). In some implementations, oneor more process blocks of FIG. 3B may be performed by another device ora group of devices separate from or including the image editing system102 (e.g., the editor interface 104 executing on a user device).

At 352, the projection subsystem extracts perceptual features from theinput image and the initial output image using a convolutional neuralnetwork. As described above with respect to FIG. 3A, perceptual featuressuch as size, shape, color, and the like can be extracted from an image.

In some embodiments, to extract perceptual features, the image editingsystem extracts the perceptual features using a convolutional neuralnetwork trained to classify images. The output of different layers ofsuch a classifier network can be correlated to different perceptualfeatures in an image. Both the initial output image(s) generated atblock 308 of FIG. 3B and the original input image (e.g., the targetimage which the optimization process aims to replicate) are passed asinput to the convolutional neural network (e.g., at an input layer) andthe perceptual features are extracted from selected layers of theconvolutional neural network.

In some implementations, the convolutional neural network is a VisualGeometry Group (VGG) network, e.g., as described in Simonyan et al.,Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR2015, arXiv:1409.1556v6 (2015). The VGG network architecture includes astack of convolutional (conv.) layers, three fully-connected layers, anda softmax layer. In some aspects, the projection subsystem selects thelayers so that high-level and low-level features are extracted.Minimizing loss between features of different levels has been found tosteer the latent space representation to preserve identity. Suitablelayers from which to extract the features as output include the conv1_1layer, the conv1_2 layer, the conv3_1 layer, and the conv4_1 layer ofthe Visual Geometry Group Very Deep 16 (VGG-VD-16) network.

In some, embodiments, the projection subsystem converts weights from theTensorFlow VGG format to PyTorch format before executing a PyTorch basedconvolutional neural network (e.g., PyTorch VGG). This has been found toproduce improved projections over use of TensorFlow or PyTorch weightsalone. The input of PyTorch is between zero and one, and the input toTensorFlow is negative one to one. This widened range helps theoptimization to converge. Accordingly, in some implementations, weightsare computed in a first format with a first range and converted to afirst format with a second range. The first range is larger than thesecond range.

The perceptual features extracted from the initial output imagerepresent initial perceptual features. These perceptual features maydiffer from the actual perceptual features in the input image that theprojection subsystem aims to replicate (e.g., the target perceptualfeatures). By minimizing a loss between the initial perceptual featuresand the target perceptual features, the projection subsystem increasesthe perceptual similarity between the input image and the final outputimage that can ultimately be generated using the latent code.

In some implementations, the projection subsystem further downsamplesthe images before extracting the perceptual features. For example, theprojection subsystem computes the perceptual loss component bydownsampling the initial output image and passing the downsampledinitial output image as input to the convolutional neural network. Theprojection subsystem extracts the initial perceptual features as outputfrom a subset of layers of the convolutional neural network. Theprojection subsystem also passes the downsampled input image (e.g., ascomputed at 304) as input to the convolutional neural network to extracttarget perceptual features from the subset of the layers of theconvolutional neural network. The projection subsystem computes theperceptual loss as a function of the target perceptual features and theinitial perceptual features.

At 354, the projection subsystem computes a perceptual loss based on theperceptual features extracted at 352. For example, the perceptual lossis the normalized difference between the perceptual features extractedfrom the input image and the perceptual features extracted from theinitial output image generated by processing the latent spacerepresentation with the generator:

PerceptualLoss(G(w),x)=∥P(G(w))−P(x)∥,

where P(G(w)) is the perceptual features extracted from the output imagegenerated by processing the latent space representation of the generatorand P(x) is the perceptual features extracted from the input image. Asillustrated in FIG. 3A, computing the loss and generating the outputimage may be performed iteratively until convergence. Accordingly, asthe latent space representation is updated in the optimization process,the output image may be regenerated and the perceptual loss may berecomputed.

In some implementations, the loss may further include a pixel losscomponent and/or a latent loss component. Steps 356 and 358 mayoptionally be performed to compute the pixel loss and latent losscomponents.

At 356, the projection subsystem computes a pixel loss component

PixelLoss(G(w),x)

based on pixels of the input image x and pixels of the initial outputimage G(w). For example, the pixel loss may be a sum of differences ofpixel values between some or all pixels in the input image andcorresponding pixels in the initial output image. An example of asuitable pixel loss function is:

$\frac{1}{n}{\sum\limits_{i = 1}^{n}{{{G(w)}_{i} - x_{i}}}^{2}}$

where the pixels of the initial output image G(w) are given by G(w)_(i)and the pixels of the input image x are given by x_(i), and the squareof the absolute value of the difference of each respective pixel issummed over the number of pixels of interest n (e.g., n total pixels inthe images). In some implementations, the downsampled input imagegenerated at 304 is used for x for consistency in image size andresolution for comparison.

At 358, the projection subsystem computes a latent loss component basedon the initial latent space representation and a target latent code. Forexample, the latent loss is given by the absolute value of thedifference between the initial latent space representation and a targetlatent code,

w−w_target∥,

where w is the latent space representation encoded at 306. The targetlatent code w_target can be a selected latent code such as the meanlatent code from the training of the generator neural network. In someimplementations, a user can provide a user-specified guiding latent codew_target, which allows for increased control in steering the projection.

The loss function used at 310 may include one or more of the perceptualloss computed at 354, the pixel loss computed at 356, and/or the latentloss computed at 358. For example, the loss function is equal to:

Loss(w,x,w_target)=PixelLoss(G(w),x)+PerceptualLoss(G(w),x)+∥w−w_target∥.

This loss function, including a pixel loss component, a perceptual losscomponent, and a latent loss component, has been found to converge in arelatively fast timeframe (<10 s) while preserving identity andresolution.

The projection techniques of FIGS. 2 and 3A-3B provide many advantagesover prior systems. These techniques offer a reasonable compromisebetween efficiency and accuracy. Projection time using the techniquesdisclosed herein is less than about 10 seconds (e.g., 8 seconds),compared to prior systems that take several minutes. Further, thetechniques of FIGS. 2 and 3 are identity preserving, high resolution(e.g., 1024×1024 pixels), and editable. In some aspects, the techniquesof FIGS. 2 and 3 can be applied without the need to modify thearchitecture of the GAN. The techniques of FIGS. 2 and 3A-3B describedabove provide identity preserving projection. The output imagesgenerated using the projection techniques described above are able tomaintain identity with (e.g., for face images, the identity of theperson is maintained), and perceptual similarity to, the input image.This is in contrast to some prior systems which fail to maintainidentity, so that the generated face images look noticeably like adifferent person than depicted in the input image.

Example Results—Identity Preserving Latent Space Projection

FIG. 4 shows a series of images 400 illustrating edited images generatedusing the projection techniques of FIGS. 3A and 3B, according to someembodiments. The process starts with an input image 402. The projectiontechniques of FIGS. 3A and 3B are used to discover an optimized latentspace representation w which can be edited to make global changes to theimage.

Images 404, 406, and 408 have been edited using an optimized latentspace representation as generated using the techniques of FIGS. 3A and3B. In image 404, the optimized latent space representation has beenedited so that the face in the output image appears younger. In image406, the optimized latent space representation has been edited so thatthe face in the output image is smiling. In image 408, the optimizedlatent space representation has been edited so that the face in theoutput image is rotated.

Using the projection technique described above with respect to FIGS.3A-3B, the images 404, 406, and 408 remain consistent with the inputimage 402—the edited images 404, 406, and 408 still look like the sameperson as that in the input image 402.

FIG. 5 depicts a set of images 500 illustrating the use of latent lossto improve details in an output image. In these examples, the lossminimized in the process of FIG. 3A includes the latent loss componentcomputed at block 358 of FIG. 3B. The latent loss component can beparticularly useful in creating realistic features. In particular, teethand eyes often become unrealistic in GAN generated images after editing.Using the techniques of FIGS. 3A-3B, these issues can be resolved.

Image 502 shows an input image which includes a picture of a human face.Images 504-510 show images generated based on the input image 502. Theimages 504-510 have been generated using the techniques described abovewith respect to FIGS. 3A-3B and the latent space representation of theinput image has been edited so that the face depicted in the outputimage has a smiling expression.

The process for generating images 504 and 508 includes optimizing alatent space representation of the input image 502 (starting either withrandomly sampled latent code or an initial latent space representationgenerated with an encoder) using a computed loss to generate anoptimized latent space representation of the input image. The computedloss, however, does not include a latent loss component. The optimizedlatent space representation is edited so that the person depicted in theimage appears to smile. This edited latent space representation isprocessed using a generator to generate output image 504 (shown zoomedin as image 508 to highlight the teeth). In images 504 and 508, theteeth appear stained and brown. This is a common issue in generatedimages using prior techniques.

On the other hand, the process for generating images 506 and 520includes optimizing a latent space representation of the input image 502using a computed loss to generate an optimized latent spacerepresentation of the input image, as described above with respect toFIGS. 3A-3B. The computed loss used to generate image 506 includes alatent loss component, as described above with respect to FIG. 3B. Theoptimized latent space representation is edited so that the persondepicted in the image appears to smile. This edited latent spacerepresentation is processed using a generator to generate output image506 (shown zoomed in as image 510 to highlight the teeth). Using thesetechniques, as shown in images 506 and 520, the appearance of the teethis significantly improved while reasonably maintaining identity with theinput image 502.

Example Techniques for Multi Resolution Output

FIG. 6 is a flowchart of an example process 600 for generatingmulti-resolution outputs from a GAN according to certain embodiments ofthe present disclosure. The processing depicted in FIG. 6 may beimplemented in software (e.g., code, instructions, program) executed byone or more processing units (e.g., processors, cores) of the respectivesystems, using hardware, or combinations thereof. The software may bestored on a non-transitory storage medium (e.g., on a memory device).The method presented in FIG. 6 and described below is intended to beillustrative and non-limiting. Although FIG. 6 depicts the variousprocessing steps occurring in a particular sequence or order, this isnot intended to be limiting. In certain alternative embodiments, theprocessing may be performed in some different order or some steps mayalso be performed in parallel. In some implementations, one or moreprocess blocks of FIG. 6 may be performed by an image editing system(e.g., the projection subsystem 110 in cooperation with other componentsof the image editing system 102 of FIG. 1). In some implementations, oneor more process blocks of FIG. 6 may be performed by another device or agroup of devices separate from or including the image editing system 102(e.g., the editor interface 104 executing on a user device).

In some embodiments, at 602, the projection subsystem obtains an inputimage. For example, the projection subsystem receives an input imagethat is uploaded via the editor interface, as described above withrespect to block 302 of FIG. 3A.

In some embodiments, at 604, the projection subsystem produces a latentspace representation of the input image. For example, the projectionsubsystem produces the latent space representation of the input image byproviding the input image as input to an encoder, in a similar fashionas described above with respect to block 306 of FIG. 3A. This may be aninitial latent space representation before optimization, or an updatedlatent space representation during or after optimization.

In some embodiments, at 606, the image editing system generates a firstoutput image at a first resolution by providing the latent space thelatent space representation of the input image as input to a generatorneural network. The generator neural network is configured to take alatent space representation as input and generate an image as output(e.g., as described above with respect to block 308 of FIG. 3A). Thegenerator neural network includes an input layer, multiple intermediatelayers, and an output layer. An intermediate layer is a layer other thanthe input or output layers of the neural network (e.g., a hidden layer).The input image is provided as input to the input layer. This inputimage is taken from one of the intermediate layers as output (e.g., froma first intermediate layer of the generator neural network).

In some embodiments, the generator neural network is coupled to one ormore auxiliary neural networks. The auxiliary neural network(s) areconfigured to output images from an intermediate layer of the generatorneural network. For example, as illustrated in FIGS. 7 and 8, auxiliaryneural networks are attached to intermediate layers of the generatorneural network. The auxiliary neural network is a branch which maps thefeatures from an arbitrary layer in a generator neural network to alow-resolution image resembling the high-resolution image output of thegenerator neural network. Alternatively, the auxiliary branch can beanother type of machine learning model configured to output an imagefrom the intermediate layer. The first output image is output via theauxiliary branch. These branches can be analogized to the levels of animage pyramid. These auxiliary neural networks, may, for example,function as an image-to-image network. In some aspects, the featuresfrom an intermediate layer of the generator neural network are input tothe auxiliary neural network and processed using residual block layersto output a relatively low-resolution image.

As a specific example, the first output image is output via a secondneural network, which is one of the one or more auxiliary neuralnetworks. Features are extracted from the intermediate layer of thegenerator neural network and processed by the second neural network togenerate the first output image. One or more operations in block 606implement a step for generating a first output image at a firstresolution using an intermediate layer of the generator neural network.

In some embodiments, the image editing system performs optimizationoperations, as described above with respect to FIGS. 3A-3B. Theoptimization process is performed using the output of the intermediatelayer of the generator neural network, using the auxiliary neuralnetwork to extract images. The image editing system uses initial firstoutput images output via the auxiliary neural network to minimize a lossfunction with respect to the input image until convergence. Thisprovides relatively fast optimization since only a subset of layers ofthe generator neural network are used.

In some embodiments, at 608, the image editing system generates a secondoutput image at a second resolution. The image editing system providesthe latent space representation of the input image as input to the inputlayer of the generator neural network to generate the second outputimage. The second output image is output from the output layer of thegenerator neural network. The second resolution of the second inputimage is different from (e.g., higher than) the first resolution of thefirst output image. In some implementations, the second output image isa relatively high resolution or large size (e.g., about 1024×1024pixels) final output image, and the first output image is a lowerresolution version of the second output image. In some aspects, thelower resolution image generated at 606 roughly matches thehigh-resolution GAN output generated at 608 after down-sampling. One ormore operations in block 608 implement a step for generating a secondoutput image at a second resolution different from the first resolutionusing an output layer of the generator neural network.

In some embodiments, the projection subsystem performs optimizationoperations, as described above with respect to FIGS. 3A-3B. Theprojection subsystem may use an initial second output image to minimizea loss function with respect to the input image until convergence, usingimages generated as output of the output layer of the neural network,until arriving at an optimized second output image. Since the firstoutput image has a shorter path through the generator (e.g., asillustrated in FIG. 8), optimization at 606 is significantly faster thanoptimization at 608 (e.g., at 606 the projection subsystem is performingforward and backward passes through a shorter path of the GAN, i.e.,using layers closer to the input layer). These techniques can also beused in a coarse-to-fine regularization.

In some embodiments, additional output images are extracted fromadditional intermediate layers of the generator neural network. Theprojection subsystem may include multiple auxiliary neural networks forextracting preview images, as illustrated in FIG. 7. For example, theprojection subsystem generates a third output image using anotherintermediate layer of the generator neural network. The image editingsystem provides the latent space representation of the input image asinput to the generator neural network. The image editing system takesthe third output image from a second intermediate layer of the generatorneural network. The third output image is of a different resolution thanthe first input image and the second input image (e.g., a fifthresolution).

In some implementations, the generator neural network includes a firstauxiliary neural network (e.g., a second neural network) configured tooutput images from the first intermediate layer of the generator neuralnetwork and a second auxiliary neural network (e.g., a third neuralnetwork) configured to output images from the second intermediate layerof the generator neural network. The first output image is output viathe first auxiliary neural network, and the third output image is outputvia the second auxiliary neural network. For example, as shown in FIG.7, the generator neural network can be augmented with multiple auxiliaryneural networks 714, 716, and 718. Each of these can be used to generateoutput images (e.g., quick preview images). The generator neural networkalso generates a final output image 726 by processing the input latentspace representation via the output layer.

In some embodiments, subsequent to generating the first output image andthe second output image, the image editing system outputs the firstoutput image and the second output image for display on a computingdevice (e.g., the editor interface 104 illustrated in FIG. 1). Forexample, the image editing system transmits instructions to a userdevice for rendering the editor interface to include the first outputimage and the second output image. The first and second output imagesmay be displayed one after another. For example, the first output imageis displayed during an optimization process of the second output image,and after optimization of the second output image is complete, thesecond output image is displayed. Alternatively, or additionally, theimage editing system may display the first output image and the secondoutput image simultaneously.

In some embodiments, the first output image is generated in less thanabout five seconds after obtaining the input image. Since first outputimage is processed using a subset of the generator neural network, thefirst output image (e.g., a preview image) can be generated more quicklythan the second output image (e.g., a final output of the generator).With the projection enhancing techniques described above with respect toFIGS. 3A-3B, the complete projection and generation process can beaccomplished in around 8 seconds. By generating the preview images usinga subset of the neural network (e.g., at 606), the preview image(s) canbe generated even faster, in about 4 seconds.

The lower resolution image(s) generated at 606 can be used, for example,to quickly obtain a preview image as a large image is being processed.In some applications, the image editing system provides output such thatthe user can watch the image develop over time. For example, a lowresolution image is displayed, then a medium resolution image, then ahigher resolution image, then a highest resolution image (e.g., thefirst output image, third output image, and second output image aredisplayed in turn).

The image editing system may train the auxiliary neural network(s) atsome initial time. For example, the training subsystem trains anauxiliary neural network on an input training image. The input trainingimage has some resolution (e.g., a third resolution). The trainingsubsystem generates a training image with a resolution lower than thatof the input training image (e.g., a fourth resolution less than thethird resolution). For example, the lower resolution training image canbe generated using downsampling techniques as described above withrespect to block 304 of FIG. 3. The training subsystem extracts featuresfrom the first intermediate layer of the generator neural network. Forexample, the input image is sent as input to the input layer of thegenerator neural network and data is extracted from the layer of thegenerator at which the auxiliary neural network will be applied (e.g.,the first intermediate layer, the second intermediate layer, etc.). Thisdata may then be processed using the auxiliary neural network togenerate a training output image. The training subsystem minimizes aloss between the reduced-resolution version of the input image and theextracted features from the intermediate layer. The auxiliary neuralnetwork is trained (e.g., using backpropagation) to output lowerresolution images which match the high-resolution generator neuralnetwork output as closely as possible. In some aspects, once theauxiliary neural network is trained, the auxiliary neural network isattached to the generator neural network. This results in an auxiliaryneural network configured to generate an image from an intermediatelayer of the generator neural network relatively quickly.

The techniques of FIG. 6 can also be applied to other generative modelsother than GANs. While the example of FIG. 6 is described with respectto image data, these techniques can also be applied to generate previewsusing other types of data like audio or video data.

Example Results—Multi-Resolution Output

FIG. 7 depicts a schematic diagram 700 illustrating the multi-resolutionoutput process of FIG. 6 according to certain embodiments of the presentdisclosure. The pipeline depicted in FIG. 7 includes a latent spacerepresentation of an input image z 702 (e.g., an initial input imagebefore optimization; updated latent space representations may also beprovided during or after optimization). The latent space representationof the input image z 702 is processed by a pretrained GAN 704 togenerate an output image 726.

The GAN further includes auxiliary neural networks 714, 716, and 718.These auxiliary neural networks 714-718 are attached to intermediatelayers of the GAN. These auxiliary neural networks 714-718 are trainedto generate low-resolution preview images of the ultimate GAN outputimage 726.

The GAN 704 includes layers 706, 708, 710, and 712. Each layer is largerin size than the previous layer. Each respective layer is capable ofgenerating images of increased resolution. For example, the GAN maystart at layer 706 by generating an image at a first resolution Res1(e.g., a 8×8 or 4×4 pixel image) and generate images increasing inresolution with successive layers (e.g., 64×64 pixels at layer 708,1024×1024 pixels at layer 710, and 2400×2400 pixels at layer 712).

The first auxiliary neural network 714 generates a lowest resolution(Res1) preview image 720 from layer 706, closest to the input layer ofthe GAN 704. The second auxiliary neural network 716 generates a higherresolution (Res2) preview image 722 from layer 708, further from theinput layer of the GAN 704. The third auxiliary neural network 718generates a highest resolution (Res3) preview image 724 from layer 710,closer to the output layer of the GAN 704. In this fashion, intermediateimages are output of different resolutions. The final output image 726has a higher resolution than the preview images (Res4). Thus,Res1<Res2<Res3<Res4.

FIG. 8 depicts another schematic diagram 800 illustrating themulti-resolution output process of FIG. 6 according to certainembodiments of the present disclosure. Similarly to FIG. 7, the pipelinedepicted in FIG. 8 includes a latent space representation of an inputimage z 802, which is processed by a pretrained GAN 804 to generate anoutput image 818. The GAN 804 includes layers 806-812 of increasing sizeand distance from the input layer of the GAN 804.

The GAN includes auxiliary neural network 814 attached to anintermediate layer 808 of the GAN 804. The auxiliary neural network 814is trained to generate a relatively low-resolution preview image 816 ofthe ultimate GAN output image 818.

FIG. 8 shows the path 822 of the preview image 816 as compared to thepath 820 of the full-resolution output image 818. The path 822 for thepreview image 816 is a relatively short path. Accordingly, duringoptimization, instead of the traversing through the entire GAN at eachiteration (as with path 820), the shorter path 822 used for the previewimage 816 allows for faster inference.

FIG. 9 depicts examples of images 900 generated and edited using thetechniques of FIG. 6, according to certain embodiments of the presentdisclosure. Rows 902, 904, 906, 908, and 910 each show a series ofimages that are generated and edited to smile based on a differentrespective input image. Columns 912, 916, 920, 924, and 928 are previewimages generated using a GAN with auxiliary neural networks for therespective preview images, as described above with respect to FIG. 6.From left to right, each column represents a larger layer of thegenerator neural network which is further from the input layer andgenerates a higher-resolution image. For comparison, columns 914, 918,922, 926, and 930 show images generated by taking an output image fromthe output layer of the GAN and downsampling the output image. As shownin FIG. 9, the preview images, which are generated relatively quickly onthe order of 1-4 seconds or less, are a reasonable approximation of theultimate output image and comparable to the downsampled versions.

These preview images are useful in the context of an editing interface.For example, the editing interface shows a preview thumbnail image ofthe final edited image as the final edited image is being processed. Thelow resolution preview images can be shown very fast without having towait for the final image. The displayed thumbnail image can be updatedas images of higher resolution are available.

FIG. 10 depicts additional examples of generated images 1002 and 1004generated using the techniques of FIG. 6, according to certainembodiments of the present disclosure. Image 1004 (right) shows agenerated image which has been output via the output layer of thegenerator. Image 1004 was generated after optimizing the latent spacerepresentation (as described above with respect to FIGS. 3A-3B) forabout 8 seconds. Image 1004 has a resolution of 1024×1024 pixels. Image1002 (left) shows a generated image which has been output from anintermediate layer of the generator, based on the same input image asimage 1004. Image 1002 was generated after optimizing the latent spacerepresentation for about 3 seconds. Image 1002 was optimized faster thanimage 1004, as each pass through the generator involves less processingwhen extracting the image 1002 from the intermediate layer. Image 1002has a resolution of 256×256 pixels. As shown in FIG. 10, the image 1002looks very similar to the final image 1004, but with less detail.

Example Techniques for Domain to Domain Projection

FIG. 11 depicts an example of a process 1100 for generating an image ina different domain (e.g., style) than the input image using adiscriminator loss, according to certain embodiments of the presentdisclosure. In some examples, the projection subsystem projects an imagein a first domain such as a collage, sketch, or cartoon, to an image ina second domain, a photorealistic image. In other examples, an image canbe projected from photorealistic image to cartoon, from sketch topainting, and so forth. In some aspects, constraints are applied whichencourage the latent variable to stay near a particular domain, such asthe natural image manifold. The processing depicted in FIG. 11 may beimplemented in software (e.g., code, instructions, program) executed byone or more processing units (e.g., processors, cores) of the respectivesystems, using hardware, or combinations thereof. The software may bestored on a non-transitory storage medium (e.g., on a memory device).The method presented in FIG. 11 and described below is intended to beillustrative and non-limiting. Although FIG. 11 depicts the variousprocessing steps occurring in a particular sequence or order, this isnot intended to be limiting. In certain alternative embodiments, theprocessing may be performed in some different order or some steps mayalso be performed in parallel. In some implementations, one or moreprocess blocks of FIG. 11 may be performed by a computing system (e.g.,the projection subsystem 110 in cooperation with other components of theimage editing system 102 of FIG. 1). In some implementations, one ormore process blocks of FIG. 11 may be performed by another device or agroup of devices separate from or including the image editing system 102(e.g., the editor interface 104 executing on a user device). In someembodiments, the process 1100 is performed using a pipeline thatincludes a GAN comprising a generator neural network and a discriminatorneural network, as shown in FIG. 1.

In some embodiments, at 1102, the projection subsystem obtains an inputimage in a first domain and information about a target domain. Forexample, the projection subsystem obtains the input image via the editorinterface. A user may upload an input image to be edited, as describedabove with respect to block 302 of FIG. 3. The first domain is aparticular image style, examples of which include a sketch, a painting,a cartoon, a three-dimensional (3D) model, a statue, a photo collage, alow-resolution image, and a photorealistic image such as a digitalphotograph.

The image editing system further receives information about a targetdomain. The target domain corresponds to an image style different fromthe first domain, e.g., photorealistic image, sketch, cartoon, etc. Theinformation about the target domain may include a target latent codecorresponding to the target domain. The target latent code w_target canbe a selected latent code such as the mean latent code from the trainingof the GAN. In some implementations, a user can provide a user-specifiedguiding latent code w_target, which allows for increased control insteering the latent code towards a desired style. In some embodiments,the projection subsystem uses the target latent code to identify atarget output of the discriminator neural network. For example, theprojection subsystem computes the target discriminator output as afunction of the generator output using a target latent code—

D(G(w_target)).

The target latent code is provided as input to the generator neuralnetwork to generate a target image. The generator neural network isconfigured to take a latent space representation as input and generatean image as output (e.g., as described above with respect to block 308of FIG. 3A). The target image is then processed by the discriminatorneural network to compute the target discriminator output. As describedabove with respect to FIG. 1, a discriminator may generate a scoreindicating whether the discriminator has determined that an imagegenerated by the generator is likely to be a real photograph or acomputer-generated copy. This can be binary (e.g., 1=photograph;0=computer-generated copy), or a score indicating confidence that theimage is a real photograph (e.g., 100=definitely a real photograph;0=definitely a computer-generated copy, with values in betweencorresponding to confidence level). In other examples, the GAN istrained on images from a domain such as impressionist paintings. In thiscase, the discriminator has been trained to identify whether an image isin the style of impressionist paintings, and this is indicated by thescore output by the discriminator.

Alternatively, the received information about the target domain may bethe target discriminator output itself. In this case, the targetdiscriminator output may, for example, be computed by an external systemor configured by an administrator.

At 1104, the projection subsystem produces an initial latent spacerepresentation of the input image by encoding the input image. Forexample, the projection subsystem produces the initial latent spacerepresentation by passing the input image as input to an encoder neuralnetwork configured to output a latent space representation of an inputimage, as described above with respect to block 306 of FIG. 3A.

At 1106, the image editing system generates an initial output image byprocessing the latent space representation with the generator neuralnetwork. The generator neural network is configured to take a latentspace representation as input and generate an image as output (e.g., asdescribed above with respect to block 308 of FIG. 3A). The image editingsystem outputs the initial output image.

At 1108, based on the initial output image and the information about thetarget domain, the image editing system computes a score indicatingwhether the initial output image is in the target domain. The score maycorrespond to the output of the discriminator neural network afterreceiving the initial output image generated at 1106 as input:

D(G(w)).

As described above with respect to block 1102, the output of thediscriminator, when given an input image, represents a confidence thatthe image is in the domain that the discriminator has been trained on.Thus, if the target domain is that of photorealistic images, adiscriminator trained on photorealistic images will output a scoreindicating whether the image generated at 1106 is a photorealisticimage. If the target domain is that of sculptures, a discriminatortrained on cartoons will output a score indicating whether the imagegenerated at 1106 is a cartoon, and so forth.

At 1110, the image editing system computes a loss as a function of thescore computed at 1108. This may be a component of an overall lossfunction based on discriminator output. An example of such adiscriminator loss component is:

∥D(G(w))−D(G(w_target))∥,

the normalized difference between the score computed at 1108 and thetarget discriminator output (e.g., the target domain informationobtained at 1102 or a derivative thereof). The discriminator loss can beused to constrain the latent space representation towards the domain inwhich the GAN has been trained. For example, using a GAN such asStyleGAN, which has been trained to generate photorealistic images offaces, minimizing the discriminator loss will pull the latent spacerepresentation towards the domain of photorealistic images of faces.Applying the discriminator loss for a GAN that has been trained on aparticular domain of images will enforce that domain. For example, thediscriminator loss can be used to constrain the latent spacerepresentation towards domains such as anime cartoons of faces,paintings of shoes, and so forth, based on the type of images used totrain the discriminator.

In some implementations, the loss function includes additionalcomponents, which may be similar to those described above with respectto FIG. 3B. In some embodiments, the loss function includes a latentloss component. For example, the latent loss component is based on adifference between the initial latent space representation and a targetlatent code. As a specific example, the latent loss component is

∥w_target−w∥,

the normalized difference between target latent code and the initiallatent space representation. The target latent code may, for example,include a mean latent code from a training phase of the generator neuralnetwork or a user-selected target latent code, as described above withrespect to block 358 of FIG. 3B and block 1102.

In alternative or additional implementations, the loss further includesa pixel loss component and/or a perceptual loss component. As describedabove with respect to FIG. 3B, a pixel loss component such as

PixelLoss(G(w),x)

can be computed by comparing the output of the generator with an inputinitial latent space representation to the original input image.Examples of perceptual loss are further described above with respect toblock 356 of FIG. 3B.

A perceptual loss component

PerceptualLoss(G(w),x)

may be computed by extracting perceptual features extracted from theinput image and perceptual features extracted an image from an imagegenerated from the initial latent space representation, as describedabove with respect to blocks 352-354 of FIG. 3B.

Accordingly, in some implementations, the loss includes a discriminatoroutput component, a latent loss component, a perceptual loss component,and a pixel loss component. An example of such a loss function is:

Loss(w,x,w_target)=PixelLoss(G(w),x)+PerceptualLoss(G(w),x)+∥w_target−w∥+∥D(G(w_target))−D(G(w))∥

In some implementations, the projection subsystem further includes anidentity loss term to guide the projection towards a particular image.This allows for projecting to a GAN manifold, but guides the projectionbased on a user-specified image. For example, if a user wants to projectan image of a sketch to a GAN manifold of realistic faces but wants theresult to look more like a certain person, the user can also provide asinput an image of that person. To guide the projection towards a domainsuch as photorealism while preserving identity, the projection subsystemcan further include an additional loss component comparing the output ofa face recognition model of the target image x (or any other image) andthe GAN output G(w). An example of a suitable face recognition model isArcFace, as described in Deng et. al., ArcFace: Additive Angular MarginLoss for Deep Face Recognition, arXiv:1801.07698 (2019). The identityloss,

IdentityLoss(G(w),x)

can be part of an overall loss function such as:

F(w,x,w_target)=PixelLoss(G(w),x)+PerceptualLoss(G(w),+∥D(G(w_target))−D(G(w))∥+IdentityLoss(G(w),x).

At 1112-1114, upon selecting and computing a suitable loss function, theprojection subsystem minimizes the loss to compute an updated latentspace representation of the input image. The projection subsystem mayuse a suitable optimizer to find a value of w to minimize the loss. Forexample, the projection subsystem computes:

argminw=PixelLoss(G(w),x)+PerceptualLoss(G(w),x)+∥w_target−w∥+∥D(G(w_target))−D(G(w))∥.

In some implementations, the projection subsystem applies theLimited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS) tominimize the loss function and identify the optimized w value, asdescribed above with respect to blocks 312-314 of FIG. 3A.

In some embodiments, the projection subsystem updates the latent spacerepresentation iteratively based on the computed loss (e.g., to generatea first updated latent space representation, a second updated latentspace representation, and so forth). This can be repeated untileventually the latent space representation is sufficiently optimized(e.g., “yes” at 314, indicating optimization has converged), at whichpoint the process 1100 proceeds to 1116.

One or more operations in blocks 1106-1114 implement a step for updatingthe initial latent space representation by minimizing a loss based onscore generated using the discriminator neural network. For instance, atblock 1106, the projection subsystem generates an initial output imageusing the generator neural network, at block 1108, the projectionsubsystem computes a score using the discriminator neural network, at1110, the image editing system computes a loss as a function of thescore computed at 1108, and at 1112-1114, the image editing systemminimizes a loss as a function of the computed score to update thelatent space representation of the input image, as described above.

In some embodiments, at 1116, the image editing system processes theupdated latent space representation with the generator neural network togenerate an output image in the target domain. This may be performed ina similar fashion as described above with respect to block 308 of FIG.3A. The image generated using the updated latent space representationwill be constrained towards the domain used to train the generatorneural network and the discriminator neural network. For example, usingStyleGAN (supra), the output image will be constrained to the domain ofphotorealistic images of faces (the target domain in this example).Examples of images projected towards realism in this fashion areillustrated in FIGS. 12-14. Alternatively, or additionally, the seconddomain may correspond to anime style, impressionist painting style,etc., when using a GAN trained on the domain of interest. Applicationsof the process 1100 include using a collage to generate a photorealisticoutput image of a face (as described below with respect to FIG. 12).Other applications include generating a cartoon from a photograph,generating a photorealistic landscape from a landscape painting, andvarious other applications of taking input from one domain andprojecting it onto another domain.

In some implementations (e.g., before 1102), the training subsystem ofthe image editing system trains the encoder neural network. For example,the training subsystem trains the encoder neural network onrandomly-generated synthetic images mapped from a Gaussian distribution.Improved domain to domain projection is obtained when the encoder hasbeen trained on synthetic data. For example, the encoder is trained toproject images to the StyleGAN latent space by training the encoder onrandomly generated synthetic images G_synthesis(G_mapping(z)) where z isa Gaussian random variable. In some aspects, the Gaussian distributionis truncated. For example, training subsystem uses a Gaussiandistribution truncated at a value between 0.6 and 0.8 (e.g., truncatedat 0.7).

As described above with respect to FIGS. 3A-3B, the optimizationtechniques of the present disclosure can be used to provide very fast(e.g., about 8 seconds) results, while maintaining high resolution(e.g., 1024×1024 pixels) and aesthetically pleasing results. Applyingthese techniques to domain to domain projection using the discriminatorloss provides a way to project an image from one domain to anotherquickly and at high resolution.

Although FIG. 11 shows example blocks of process 1100, in someimplementations, process 1100 may include additional blocks, fewerblocks, different blocks, or differently arranged blocks than thosedepicted in FIG. 11. Additionally, or alternatively, two or more of theblocks of process 1100 may be performed in parallel.

Example Results—Domain to Domain Projection

FIG. 12 depicts examples of images illustrating a photo collage 1202which is merged to generate a photorealistic image 1212 using thetechniques of FIG. 11. In some implementations, the image editing systemprovides a user interface for generating a collage of facial features1202, (e.g., as part of the editor interface 104 of FIG. 1).

For example, the image editing system displays a user interface. Theimage editing system receives input from a user to generate a collageusing a set of initial images. The editor interface may provide uploadelements configured to accept user input to upload a set of images. Theeditor interface may further provide editing elements configured toreceive user input to cut and paste the images to create a photocollage. As shown in FIG. 12, the top portion of the head 1204 is fromone image, the eyes 1206 are from another image, the middle portion ofthe head 1208 is from another image, and the mouth and chin 1210 arefrom yet another image. The user can interact with the editor interfaceto cut and arrange the images to generate the collage 1202.

Once the collage 1202 has been configured, the collage 1202 is passed asinput for the processing of FIG. 11. The collage is encoded andoptimized before generating an image using a generator neural network.The processing of FIG. 11 is used to enforce realism in the optimizationprocess. In the example illustrated in FIG. 12, the output image 1212 isa photorealistic image generated from the collage 1202.

The collage feature can be useful for generating a photorealistic faceusing a combination of facial features, as shown in FIG. 12. Otheruseful applications of the collage feature include combining home decorelements to blend to a photorealistic image for use in interior designor landscaping, or combining clothing items to blend to a photorealisticimage of an outfit. In other examples, the domain constraint of FIG. 11may be constrained towards another domain, other than realism (e.g., toa cartoon-like image), and the collage can be processed to generate acartoon-like image, a sketch-like image, and so forth.

FIG. 13 depicts examples of images illustrating using a sketch togenerate a more photorealistic image using the techniques of FIG. 11. Insome applications, a sketch 1302 is the input image obtained at block1102 of FIG. 11. The output image generated at block 1116 of FIG. 11 isa more photorealistic image of a face 1304. Accordingly, as shown inFIG. 13, the projection techniques of FIG. 11 can be used to make anartistic sketch look more like a photorealistic face.

FIG. 14 depicts examples of images illustrating using athree-dimensional (3D) drawing to generate a photorealistic image usingthe techniques of FIG. 11 according to certain embodiments of thepresent disclosure. In some applications, a 3D drawing 1402 is the inputimage obtained at block 1102 of FIG. 11. The output image generated atblock 1116 of FIG. 11 looks more like a photorealistic image 1404.Accordingly, as shown in FIG. 14, the projection techniques of FIG. 11can be used to make a 3D drawing look more like a photorealistic face.

Example of a Computing System for GAN Based Image Processing

Any suitable computing system or group of computing systems can be usedfor performing the operations described herein. For example, FIG. 15depicts examples of computing system 1500 that executes an image editingsystem 102 that includes an edit management subsystem 120 for performingimage processing as described herein. In some embodiments, the computingsystem 1500 also executes a projection subsystem 110 for performinglatent space projection as described herein, an image generationsubsystem 130 for performing image generation as described herein, atraining subsystem 140 for performing machine learning model training asdescribed herein, and an editor interface 104 for controlling input andoutput to configure image edits as described herein. In otherembodiments, a separate computing system having devices similar to thosedepicted in FIG. 15 (e.g., a processor, a memory, etc.) executes one ormore of the subsystems 110-140 and the editor interface 104.

The depicted examples of a computing system 1500 includes a processor1502 communicatively coupled to one or more memory devices 1504. Theprocessor 1502 executes computer-executable program code stored in amemory device 1504, accesses information stored in the memory device1504, or both. Examples of the processor 1502 include a microprocessor,an application-specific integrated circuit (“ASIC”), afield-programmable gate array (“FPGA”), or any other suitable processingdevice. The processor 1502 can include any number of processing devices,including a single processing device.

The memory device 1504 includes any suitable non-transitorycomputer-readable medium for storing data, program code, or both. Acomputer-readable medium can include any electronic, optical, magnetic,or other storage device capable of providing a processor withcomputer-readable instructions or other program code. Non-limitingexamples of a computer-readable medium include a magnetic disk, a memorychip, a ROM, a RAM, an ASIC, optical storage, magnetic tape or othermagnetic storage, or any other medium from which a processing device canread instructions. The instructions may include processor-specificinstructions generated by a compiler or an interpreter from code writtenin any suitable computer-programming language, including, for example,C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, andActionScript.

The computing system 1500 may also include a number of external orinternal devices, such as input or output devices. For example, thecomputing system 1500 is shown with one or more input/output (“I/O”)interfaces 1508. An I/O interface 1508 can receive input from inputdevices or provide output to output devices. One or more buses 1506 arealso included in the computing system 1500. The bus 1506 communicativelycouples one or more components of a respective one of the computingsystem 1500.

The computing system 1500 executes program code that configures theprocessor 1502 to perform one or more of the operations describedherein. The program code includes, for example, the image editing system102, including the projection subsystem 110, the edit managementsubsystem 120, the image generation subsystem 130, the trainingsubsystem 140, the editor interface 104, or other suitable applicationsthat perform one or more operations described herein. The program codemay be resident in the memory device 1504 or any suitablecomputer-readable medium and may be executed by the processor 1502 orany other suitable processor. In some embodiments, the projectionsubsystem 110, the edit management subsystem 120, the image generationsubsystem 130, the training subsystem 140, and the editor interface 104are stored in the memory device 1504, as depicted in FIG. 15. Inadditional or alternative embodiments, one or more of the image theprojection subsystem 110, the edit management subsystem 120, the imagegeneration subsystem 130, the training subsystem 140, and the editorinterface 104 are stored in different memory devices of differentcomputing systems. In additional or alternative embodiments, the programcode described above is stored in one or more other memory devicesaccessible via a data network.

The computing system 1500 can access data in any suitable manner. Insome embodiments, some or all of one or more of these data sets, models,and functions are stored in the memory device 1504, as in the exampledepicted in FIG. 15. For example, a computing system 1500 that executesthe training subsystem 140 can access training data stored by anexternal system.

In additional or alternative embodiments, one or more of these datasets, models, and functions are stored in the same memory device (e.g.,one of the memory devices 1504). For example, a common computing systemcan host the edit management subsystem 120 and the training subsystem140 as well as the training data. In additional or alternativeembodiments, one or more of the programs, data sets, models, andfunctions described herein are stored in one or more other memorydevices accessible via a data network.

The computing system 1500 also includes a network interface device 1510.The network interface device 1510 includes any device or group ofdevices suitable for establishing a wired or wireless data connection toone or more data networks. Non-limiting examples of the networkinterface device 1510 include an Ethernet network adapter, a modem, andthe like. The computing system 1500 is able to communicate with one ormore other computing devices (e.g., a computing device executing theeditor interface 104 as depicted in FIG. 1) via a data network using thenetwork interface device 1510.

In some embodiments, the functionality provided by the computing system1500 may be offered via a cloud-based service provided by a cloudinfrastructure 1600 provided by a cloud service provider. For example,FIG. 16 depicts an example of a cloud infrastructure 1600 offering oneor more services including image editing software as-a-service 1604 thatoffers image editing functionality as described in this disclosure. Sucha service can be subscribed to and used by a number of user subscribersusing user devices 1610A, 1610B, and 1610C across a network 1608. Theservice may be offered under a Software as a Service (SaaS) model. Oneor more users may subscribe to such as service.

In the embodiment depicted in FIG. 16, the cloud infrastructure 1600includes one or more server computer(s) 1602 that are configured toperform processing for providing one or more services offered by thecloud service provider. One or more of server computer(s) 1602 mayimplement a projection subsystem 110, edit management subsystem 120,image generation subsystem 130, and training subsystem 140, as depictedin FIG. 15. The subsystems 110-140 may be implemented using softwareonly (e.g., code, program, or instructions executable by one or moreprocessors provided by cloud infrastructure 1600), in hardware, orcombinations thereof. For example, one or more of the server computer(s)1602 may execute software to implement the services and functionalitiesprovided by subsystems 110-140, where the software, when executed by oneor more processors of the server computer(s) 1602, causes the servicesand functionalities to be provided.

The code, program, or instructions may be stored on any suitablenon-transitory computer-readable medium such as any electronic, optical,magnetic, or other storage device capable of providing a processor withcomputer-readable instructions or other program code. Non-limitingexamples of a computer-readable medium include a magnetic disk, a memorychip, a ROM, a RAM, an ASIC, optical storage, magnetic tape or othermagnetic storage, or any other medium from which a processing device canread instructions. The instructions may include processor-specificinstructions generated by a compiler or an interpreter from code writtenin any suitable computer-programming language, including, for example,C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, andActionScript. In various examples, the server computer(s) 1602 caninclude volatile memory, non-volatile memory, or a combination thereof.

In the embodiment depicted in FIG. 16, cloud infrastructure 1600 alsoincludes a network interface device 1606 that enables communications toand from cloud infrastructure 1600. In certain embodiments, the networkinterface device 1606 includes any device or group of devices suitablefor establishing a wired or wireless data connection to the network1608. Non-limiting examples of the network interface device 1606 includean Ethernet network adapter, a modem, and/or the like. The cloudinfrastructure 1600 is able to communicate with the user devices 1610A,1610B, and 1610C via the network 1608 using the network interface device1606.

An editor interface (e.g., editor interface 104A, editor interface 104B,and editor interface 104C) may be displayed on each of the user devicesuser device A 1610A, user device B 1610B, and user device C 1610C. Auser of user device 1610A may interact with the displayed editorinterface, for example, to enter an input image and/or imagemodification parameters. In response, processing for image processingmay be performed by the server computer(s) 1602.

GENERAL CONSIDERATIONS

Numerous specific details are set forth herein to provide a thoroughunderstanding of the claimed subject matter. However, those skilled inthe art will understand that the claimed subject matter may be practicedwithout these specific details. In other instances, methods,apparatuses, or systems that would be known by one of ordinary skillhave not been described in detail so as not to obscure claimed subjectmatter.

Unless specifically stated otherwise, it is appreciated that throughoutthis specification discussions utilizing terms such as “processing,”“computing,” “calculating,” “determining,” and “identifying” or the likerefer to actions or processes of a computing device, such as one or morecomputers or a similar electronic computing device or devices, thatmanipulate or transform data represented as physical electronic ormagnetic quantities within memories, registers, or other informationstorage devices, transmission devices, or display devices of thecomputing platform.

The system or systems discussed herein are not limited to any particularhardware architecture or configuration. A computing device can includeany suitable arrangement of components that provide a result conditionedon one or more inputs. Suitable computing devices include multi-purposemicroprocessor-based computer systems accessing stored software thatprograms or configures the computing system from a general purposecomputing apparatus to a specialized computing apparatus implementingone or more embodiments of the present subject matter. Any suitableprogramming, scripting, or other type of language or combinations oflanguages may be used to implement the teachings contained herein insoftware to be used in programming or configuring a computing device.

Embodiments of the methods disclosed herein may be performed in theoperation of such computing devices. The order of the blocks presentedin the examples above can be varied—for example, blocks can bere-ordered, combined, and/or broken into sub-blocks. Certain blocks orprocesses can be performed in parallel.

The use of “adapted to” or “configured to” herein is meant as open andinclusive language that does not foreclose devices adapted to orconfigured to perform additional tasks or steps. Additionally, the useof “based on” is meant to be open and inclusive, in that a process,step, calculation, or other action “based on” one or more recitedconditions or values may, in practice, be based on additional conditionsor values beyond those recited. Headings, lists, and numbering includedherein are for ease of explanation only and are not meant to belimiting.

While the present subject matter has been described in detail withrespect to specific embodiments thereof, it will be appreciated thatthose skilled in the art, upon attaining an understanding of theforegoing, may readily produce alterations to, variations of, andequivalents to such embodiments. Accordingly, it should be understoodthat the present disclosure has been presented for purposes of examplerather than limitation, and does not preclude the inclusion of suchmodifications, variations, and/or additions to the present subjectmatter as would be readily apparent to one of ordinary skill in the art.

1. A computer-implemented method comprising: producing an initial latentspace representation of an input image by encoding the input image;generating, by a generator neural network, an initial output image byprocessing the initial latent space representation of the input image;generating an optimized latent space representation of the input imageusing a loss minimization technique that minimizes a loss between theinput image and the initial output image, wherein the loss is based ontarget perceptual features extracted from the input image and initialperceptual features extracted from the initial output image; andoutputting the optimized latent space representation of the input imagefor downstream use.
 2. The method of claim 1, further comprisingdownsampling the input image before generating the initial latent spacerepresentation of the input image.
 3. The method of claim 2, furthercomprising computing the loss by: downsampling the initial output image;passing the downsampled initial output image as input to a convolutionalneural network and extracting the initial perceptual features as outputfrom a subset of layers of the convolutional neural network; passing thedownsampled input image as input to the convolutional neural network andextracting the target perceptual features from the subset of the layersof the convolutional neural network; and computing the loss based uponthe target perceptual features and the initial perceptual features. 4.The method of claim 3, wherein the convolutional neural network is aVisual Geometry Group (VGG) network, and wherein the subset of thelayers include a conv1_1 layer, a conv1_2 layer, a conv3_1 layer, and aconv4_1 layer of the VGG network.
 5. The method of claim 1, wherein theloss is further based on one or more of: a comparison of pixels of theinput image and pixels of the initial output image; or a comparison ofthe initial latent space representation and a target latent code.
 6. Themethod of claim 1, the downstream use comprising one or more of:applying user-configured edits to the latent space representation of theinput image; or generating an output image, by the generator neuralnetwork, by processing the optimized latent space representation,wherein the output image is perceptually similar to the input image. 7.The method of claim 6, wherein the producing the initial latent spacerepresentation, optimizing the initial latent space representation, andgenerating the output image that is perceptually similar to the inputimage are performed in less than about 10 seconds.
 8. The method ofclaim 7, wherein the output image has a resolution of about 1024×1024pixels.
 9. The method of claim 6, further comprising: outputting theoutput image for display on a computing device.
 10. A computing systemcomprising: a processor; a non-transitory computer-readable mediumcomprising instructions which, when executed by the processor, performprocessing comprising: producing an initial latent space representationof the input image by encoding an input image; generating, by agenerator neural network, an initial output image by processing theinitial latent space representation of the input image; generating anoptimized latent space representation of the input image using a lossminimization technique that minimizes a loss between the input image andthe initial output image, wherein the loss is based on target perceptualfeatures extracted from the input image and initial perceptual featuresextracted from the initial output image; and outputting the optimizedlatent space representation of the input image for downstream use. 11.The computing system of claim 10, the processing further comprisingdownsampling the input image before generating the initial latent spacerepresentation of the input image.
 12. The computing system of claim 11,the processing further comprising computing the loss by: downsamplingthe initial output image; passing the downsampled initial output imageas input to a convolutional neural network and extracting the initialperceptual features as output from a subset of layers of theconvolutional neural network; passing the downsampled input image asinput to the convolutional neural network and extracting the targetperceptual features from the subset of the layers of the convolutionalneural network; and computing the loss component based upon the targetperceptual features and the initial perceptual features.
 13. Thecomputing system of claim 12, wherein the convolutional neural networkis a Visual Geometry Group (VGG) network, and wherein the subset of thelayers include a conv1_1 layer, a conv1_2 layer, a conv3_1 layer, and aconv4_1 layer of the VGG network.
 14. The computing system of claim 10,wherein the loss is further based on one or more of: a comparison ofpixels of the input image and pixels of the initial output image; or acomparison of the initial latent space representation and a targetlatent code.
 15. The computing system of claim 10, the downstream usecomprising one or more of: applying user-configured edits to the latentspace representation of the input image; or generating an output image,by the generator neural network, by processing the optimized latentspace representation, wherein the output image is perceptually similarto the input image.
 16. The computing system of claim 15, wherein thedownsampling, generating the initial latent space representation,optimizing the initial latent space representation, and generating theoutput image that is perceptually similar to the input image areperformed in less than about 10 seconds.
 17. The computing system ofclaim 16, wherein the output image has a resolution of about 1024×1024pixels.
 18. The computing system of claim 15, the processing furthercomprising: outputting the output image via a display of a computingdevice.
 19. A non-transitory computer-readable medium havinginstructions stored thereon, the instructions executable by a processingdevice to perform operations comprising: producing an initial latentspace representation of an input image by encoding the input image; astep for generating an optimized latent space representation of theinput image based on target perceptual features extracted from the inputimage and initial perceptual features extracted from the initial outputimage; and outputting the optimized latent space representation of theinput image for downstream use.
 20. The medium of claim 19, thedownstream use comprising one or more of: applying user-configured editsto the latent space representation of the input image; or generating anoutput image, by the generator neural network, by processing theoptimized latent space representation, wherein the output image isperceptually similar to the input image.